-
Notifications
You must be signed in to change notification settings - Fork 758
feat(http): add error handling for exporting #4709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(http): add error handling for exporting #4709
Conversation
d3d4031 to
afb76f2
Compare
| try: | ||
| resp = self._export(serialized_data, deadline_sec - time()) | ||
| except Exception as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind opening an issue or point to some existing issue for the behavior you're encountering? We can discuss possible fixes there.
I think there is a bug in _export() on L157, where the code assumes the connection will succeed, and so the retry loop here is never executed. However the raised exception should be ultimately caught here
opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/_shared_internal/__init__.py
Lines 167 to 170 in d327927
| except Exception: # pylint: disable=broad-exception-caught | |
| self._logger.exception( | |
| "Exception while exporting %s.", self._exporting | |
| ) |
If you just want the logs to go away, you can set a filter on the opentelemetry.sdk._shared_internal logger or disable it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea so the issue for me was that the logger logged the entire error trace to the std.out, which totally makes sense given the code that you shared.
So my goal was to still get to know when something fails and not just disable the logger, as I feel like the otlp_exporters should handle this internally.
On top of that I would like to generally comment that doing a retry over a try-except in _export() feels kinda odd to me.
Thanks for taking your time to look into this! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still think I should open an issue for that generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aabmass, from what I understand, this is basically when the HTTP endpoint is not available and the urllib call is unable to complete the request, and this outputs the full stacktrace of urllib with the connection refused error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an appropriate fix could be in export fix the retry logic under a try/catch, like:
try:
resp = self._export(serialized_data, remaining_time)
except Exception as exc:
is_retryable = True
else:
is_retryable = _is_retryable(resp)
and use that variable in the if at L183. Maybe needs more adjustments. wdyt @aabmass @DylanRussell ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense to me @pafi-code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to be clear: You want me to implement the changes regarding supporting the retry loop in case the request is not successful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pafi-code if you are open to do that would be good, if not please lmk so I can file a PR, but I believe the correct approach is to fix the bug of the request not getting into the retry loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is updated. Two questions:
- Are you fine with me excepting
Exceptionsin general or should I only handleConnectionError? I assumeConnectionErroris preferable. - Are you fine with the log message in case of time out or max retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- @pafi-code let's catch
requests.exceptions.RequestException, wdyt? - My first thought on this fix was:
try:
resp = self._export(serialized_data, remaining_time)
if resp.ok:
return SpanExportResult.SUCCESS
except requests.exceptions.RequestException as exc:
reason = str(exc)
retryable = True
status_code = None
else:
retryable = _is_retryable(resp)
status_code = resp.status_code
reason = resp.reason
# Compute jittered exponential backoff
backoff = (2 ** retry_num) * random.uniform(0.8, 1.2)
remaining_time = deadline - time()
if (
not retryable
or retry_num + 1 == _MAX_RETRYS
or backoff > remaining_time
or self._shutdown
):
if status_code is None:
_logger.error("Failed to export span batch: %s", reason)
else:
_logger.error(
"Failed to export span batch code: %s, reason: %s",
status_code,
reason,
)
return SpanExportResult.FAILURE
_logger.warning(
"Transient error %s encountered while exporting span batch, retrying in %.2fs.",
reason,
backoff,
)
if is_shutdown_in_progress.wait(backoff):
_logger.warning("Shutdown in progress, aborting retry.")
break
return SpanExportResult.FAILURE
|
Okay so I know that the pipeline is failing and I assume this PR should be rebased, but for me there is still the question if my proposed fix is actually how it should be handled. Would appreciate some more feedback here. Also I've created an issue related to this PR: #4712 |
afb76f2 to
f581f22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pafi-code can we have tests for this?
Description
When using the OTEL http exporters the std.out gets polluted by unhandled error traces. I don't think that this should be the desired result. This happens when the endpoint is not available when exporting telemetry. This can be cause by different events but in general I think it should be better handled.
Therefore I added error handling with more specific and shorted error messages.
Type of change
I'm not sure if the change type is rather fix or feature, as the service was still running.
How Has This Been Tested?
Does This PR Require a Contrib Repo Change?
Checklist: